Text Categorization and Information Retrieval Using WordNet Senses
نویسندگان
چکیده
In this paper we study the influence of semantics in the Text Categorization (TC) and Information Retrieval (IR) tasks. The K Nearest Neighbours (K-NN) method was used to perform the text categorization. The experimental results were obtained taking into account for a relevant term of a document its corresponding WordNet synset. For the IR task, three techniques were investigated: the direct use of a weighted matrix, the Singular Value Decomposition (SVD) technique in the Latent Semantic Indexing (LSI) model, and the bisecting spherical k-means clustering technique. The experimental results we obtained taking into account the semantics of the documents, allowed for an improvement of the performance for the text categorization whereas they were not so promising for the IR task.
منابع مشابه
Desiderata For Tagging With WordNet Synsets Or MCCA Categories
Minnesota Contextual Content Analysis (MCCA) is a technique for characterizing the concepts and themes occurring in text (sentences, paragraphs, interview transcripts, books). MCCA tags each word with a category and examines the distribution of categories against norms representing general usage of categories. MCCA also scores texts in terms of social contexts that are similar to different func...
متن کاملInformation Retrieval and Text Categorization with Semantic Indexing
In this paper, we present the effect of the semantic indexing using WordNet senses on the Information Retrieval (IR) and Text Categorization (TC) tasks. The documents have been sense-tagged using a Word Sense Disambiguation (WSD) system based on Specialized Hidden Markov Models (SHMMs). The preliminary results showed that a small improvement of the performance was obtained only in the TC task. ...
متن کاملAutomatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملWordNet and Automated Text Summarization
Proposals for text classification and information retrieval have been recently presented making use of the WordNet ontology. Generally, this methodology requires statistical induction of synset clusters and entails costly training of specific key domains. The present proposal intends to show that a simple recursive evaluation procedure and WordNet are rich enough to obtain useful results in tex...
متن کاملITRI-00-28 What’s in a thesaurus
We first describe four varieties of thesaurus: (1) Roget-style, produced to help people find synonyms when they are writing; (2) WordNet and EuroWordNet; (3) thesauruses produced (manually) to support information retrieval systems; and (4) thesauruses produced automatically from corpora. We then contrast thesauruses and dictionaries, and present a small experiment in which we look at polysemy i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004